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Abstract 



> . 

, This paper studies sequence prediction based on the monotone Kolmogorov 

complexity Km = —\ogm, i.e. based on universal deterministic/one-part MDL. 
00 ■ m is extremely close to Solomonoff 's universal prior M, the latter being an ex- 

. cellent predictor in deterministic as well as probabilistic environments, where 

performance is measured in terms of convergence of posteriors or losses. De- 
spite this closeness to M, it is difficult to assess the prediction quality of m, 
O . since little is known about the closeness of their posteriors, which are the im- 

portant quantities for prediction. We show that for deterministic computable 
k>«( \ environments, the "posterior" and losses of m converge, but rapid convergence 

j_j ■ could only be shown on-sequence; the off-sequence convergence can be slow. 



In probabilistic environments, neither the posterior nor the losses converge, 
in general. 

Keyword 

Sequence prediction; Algorithmic Information Theory; Solomonoff's prior; 
Monotone Kolmogorov Complexity; Minimal Description Length; Conver- 
gence; Self-Optimization. 



'Part of this work appeared in the proceedings of the 2003 COLT conference |Hut03b| . 
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1 Introduction 

In this work we study the performance of Occam's razor based sequence predictors. 
Given a data sequence Xi, X2, Xn-i we want to predict (certain characteristics) 
of the next data item Xn- Every Xt is an element of some domain X, for instance 
weather data or stock-market data at time t, or the t*^ digit of vr. Occam's razor 
|LV97j . appropriately interpreted, tells us to search for the simplest explanation 
(model) of our data xi,...,Xn-i and to use this model for predicting Simplicity, 
or more precisely, effective complexity can be measured by the length of the shortest 
program computing sequence x := Xi...Xn-i- This length is called the algorithmic 
information content of x, which we denote by K{x). K stands for one of the many 
variants of "Kolmogorov" complexity (plain, prefix, monotone, ...) or for — logA;(x) 
of universal distributions/measures k{x). 

Algorithmic information theory mainly considers binary sequences. For finite 
alphabet X one could code each G A" as a binary string of length l^logl^Yp, but this 
would not simplify the analysis in this work. The reason being that binary coding 
would not reduce the setting to bit by bit predictions, but to predict a block of bits 
before observing the true block of bits. The only difference in the analysis of general 
alphabet versus binary block-prediction is in the convention of how the length of a 
string is defined. 

The most well-studied complexity regarding its predictive properties is KM{x) = 
— logM(a;), where M{x) is Solmonoff's |Sol641 Eq.(7)] universal prior. Solomonoff 
has shown that the posterior M{xt\xi...Xt-i) rapidly converges to the true data 
generating distribution |Sol78j . In [HutOlbl IHut03aj it has been shown that M is 
also an excellent predictor from a decision-theoretic point of view, where the goal is 
to minimize loss. In any case, for prediction, the posterior M{xt\xi...Xt~.i), rather 
than the prior M{xi...Xt), is the more important quantity. 

Most complexities K coincide within an additive logarithmic term, which implies 
that their "priors" k = 2~~^ are close within polynomial accuracy. Some of them are 
extremely close to each other. Many papers deal with the proximity of various 
complexity measures |Lev73at |Gac83t ...]. Closeness of two complexity measures is 



regarded as indication that the quality of their prediction is similarly good |LV97t 
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p. 334]. On the other hand, besides M, httle is really known about the closeness of 
"posteriors" , relevant for prediction. 

Aim and conclusion. The main aim of this work is to study the predictive prop- 
erties of complexity measures other than KM. The monotone complexity Km is, in 
a sense, closest to Solomonoff complexity KM. While KM is defined via a mixture 
of infinitely many programs, the conceptually simpler Km approximates KM by 
the contribution of the single shortest program. This is also closer to the spirit of 
Occam's razor. Km is a universal deterministic/one-part version of the popular Min- 
imal Description Length (MDL) principle. We mainly concentrate on Km because 
it has a direct interpretation as a universal deterministic/one-part MDL predictor, 
and it is closest to the excellent performing KM., so we expect predictions based on 
other K not to be better. 

The main conclusion we will draw is that closeness of priors does neither neces- 
sarily imply closeness of posteriors, nor good performance from a decision-theoretic 
perspective. It is far from obvious, whether Km is a good predictor in general, and 
indeed we show that Km can fail (with probability strictly greater than zero) in the 
presence of noise, as opposed to KM. We do not suggest that Km fails for sequences 
occurring in practice. It is not implausible that (from a practical point of view) mi- 
nor extra (apart from complexity) assumptions on the environment or loss function 
are sufficient to prove good performance of Km. Some complexity measures like the 
prefix complexity i^, fail completely for prediction. 

Contents. S'ec^zonl^ introduces notation and describes how prediction performance 
is measured in terms of convergence of posteriors or losses. Section summarizes 
known predictive properties of Solomonoff 's prior M. 5'ectzon^introduces the mono- 
tone complexity Km and the prefix complexity K and describes how they and other 
complexity measures can be used for prediction. In Section [31 we enumerate and 
relate eight important properties, which general predictive functions may posses or 
not: proximity to M, universality, monotonicity, being a semimeasure, the chain 
rule, enumerability, convergence, and self-optimization. Some later needed normal- 
ization issues are also discussed. Furthermore, convergence of non-semimeasures 
that are close to M is proven. S'ecizonl^ contains our main results. Monotone com- 
plexity Km is analyzed quantitatively w.r.t. the eight predictive properties. Qual- 
itatively, for deterministic, computable environments, the posterior converges and 
is self-optimizing, but rapid convergence could only be shown on-sequence; the (for 
prediction equally important) off-sequence convergence can be slow. In probabilistic 
environments, m neither converges, nor is it self-optimizing, in general. Section 
presents some further results: Poor predictive performance of the prefix complexity 
K is shown and a simpler MDL-inspired way of using Km for prediction is briefiy 
discussed. Section [3 contains an outlook and a list of open question, including the 
convergence speed of m, natural Turing machines, non-self-optimization for gen- 
eral Turing machines and losses, other complexity measures, two-part MDL, extra 
conditions on environments, and other generalizations. 
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2 Notation and Setup 



Strings and natural numbers. We write X* for ttie set of finite strings over 
finite alphabet X, and for the set of infinity sequences. We use letters i,t,n for 
natural numbers, x,y,z for finite strings, e for the empty string, i{x) for the length 
of string x, and lj = xi:oo for infinite sequences. We write xy for the concatenation of 
string X with y. For a string of length n we write XiX2---Xn with XtG A" and further 
abbreviate xi;n- = xiX2-.-Xn-iXn and x<„ : = xi...x„_i. For a given sequence xi:oo we 
say that Xt is on-sequence and Xt^Xt is off-sequence, x'^ may be on- or off-sequence. 

Prefix sets/codes. String x is called a (proper) prefix of y if there is a z{^e) such 
that X2; = y. We write x* = y in this case, where * is a wildcard for a string, and 
similarly for infinite sequences. A set of strings is called prefix-free if no element is 
a proper prefix of another. A prefix-free set V is also called a prefix code. Prefix 
codes have the important property of satisfying Kraft's inequality Z^xg'pI'^I"^*"^'* — 1- 

Asymptotic notation. We abbreviate limj^oo[/(^) — 5'(i)] = by /(t) — — fi'(t) 
and say / converges to without implying that \im.t^aog{t) itself exists. The 
big 0-notation f[x) =0{g{x)) means that there are constants c and Xq > such 
that |/(a;)| < c\g{x)\\/x > Xq. The small o-notation f{x) = o{g{x)) abbreviates 

X + 

\imx:^oof{x)/g{x) =0. We write f{x) < g{x) for f{x) = 0{g{x)) and f{x) < g{x) 
for /(x) < g{x) + 0{l). Corresponding equalities can be defined similarly. They 
hold if the corresponding inequalities hold in both directions. Z^^iO^ < oo implies 
at*-^^0. We say that at converges fast or rapidly to zero if Z^t^iC^? <c, where c is a 
constant of reasonable size; c=100 is reasonable, maybe even c = 2^°, but c = 2^°° is 
not.^ The number of times for which at deviates from by more than e is finite and 
bounded by c/e^; no statement is possible for which t these deviations occur. The 
cardinality of a set S is denoted by \S\ or #5. For properties A{t) G {true, false} 
we say 



A{t) is valid for ... t 


almost all 


most 


many 


finitely many 




+ 

= n 


= n — o{n) 


X 

= n 


< c (3c) 



(Semi) measures. We call p: A** — [0,1] a (semi)measure ijfJ2x„€xP{^i:n)^— p{x<:n) 
and p(e) = 1. p{x) is interpreted as the p-probability of sampling a sequence which 
starts with x. In case of a semimeasure the gap gn = 1 — Z]a:i.„p(3;i:n) >0 may be 
interpreted as the possibility /probability of finite sequences of length less than n 
|ZL70| ISchnnj . or as an evidence gap in Dempster- Shafer theory |Dem68[ ISha76j . 
The conditional probability (posterior) 

p{xt\x,t) := ^ (1) 



^Environments of interest have reasonable complexity K, but 2^ is not of reasonable size 
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is the p-probability that a string xi...Xt-i is followed by (continued with) xt- We 
call p deterministic if 3uj: p{uji-n) = 1 Vn. In this case we identify p with u. 

Convergent predictors. We assume that p is the "true"^ sequence generating 
measure, also called environment. If we know the generating process p, and given 
past data x<t we can predict the probability p{xt\x^t) of the next data item xt- 
Usually we do not know p, but estimate it from x<t. Let p{xt\x^t) be an estimated 
probability^ of Xt, given x<f. Closeness of p(xf|x<j) to /i(xf|x<t) is expected to lead 
to "good" predictions: 

Consider, for instance, a weather data sequence Xi:„ with Xt = l meaning rain and 
Xt = meaning sun at day t. Given x<j the probability of rain tomorrow is /i(l|x<t). 
A weather forecaster may announce the probability of rain to be yt:=p{l\x^t), which 
should be close to the true probability /i(l|x<j). To aim for 

p(xj|x<j) ^^^^^ p{x[\x<^t) for t — >• oo (2) 

seems reasonable. A sequence of random variables Zt = Zt{uj) (like Zt = p{xt\x^t) — 
p{xt\x^t)) is said to converge to zero with /i-probability 1 (w.p.l) if the set {uj : 
Zt{uj)^—^0} has /i-measure 1. Zt is said to converge to zero in mean sum (i.m.s) 
if Z)t^iE[zt^] <c<oo, where E denotes /i-expectation. Convergence i.m.s. implies 
convergence w.p.l (rapid if c is of reasonable size). 

Depending on the interpretation, a p satisfying (j2)) could be called consistent 
or self-tuning |KV86j . One problem with using (0) as performance measure is that 
closeness cannot be computed, since p is unknown. Another disadvantage is that Q 
does not take into account the value of correct predictions or the severity of wrong 
predictions. 

Self-optimizing predictors. More practical and flexible is a decision-theoretic 
approach, where performance is measured w.r.t. the true outcome sequence xi.n 
by means of a loss function, for instance i^tyt •= {^t — ytY-, which does not involve 
p. More generally, let Ixtyt ^ [0)1] C iR be the received loss when performing some 
prediction/decision/action yt^y and Xt&X is the t*^ symbol of the sequence. Let 
yt&y be the prediction of a (causal) prediction scheme A. The true probability of 
the next symbol being xt, given x<j, is p{xt\x<:t)- The p-expected loss (given x^t) 
when A predicts the t*^ symbol is 

lt{x<t) ■■= II/^(a;i|a;<t)4,j,A. 

The goal is to minimize the /x-expected loss. More generally, we define the Ap 
sequence prediction scheme 

yf" := argmin^p(xj|x<t)4ts/i, (3) 



^Also called objective or aleatory probability or chance. 
•^Also called subjective or belief oi epistemic probability. 
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which minimizes the p-expected loss. If /x is known, is obviously the best pre- 
diction scheme in the sense of achieving minimal expected loss (/^'' for all A). 
An important special case is the error loss ixy = ^ — Sxy with y = X. In this case Ap 
predicts the yt which maximizes p(?/j|x<f), and I]jE[/^''] is the expected number of 
prediction errors (where y^f^Xt)- The natural decision-theoretic counterpart of (j21) 
is to aim for 

lf''{x<t) lf'{x<t) for t^oo (4) 

what is called (without the fast supplement) self-optimization in control-theory 
(RV861 . 



3 Predictive Properties of M = 2-^^^^ 

We define a prefix/monotone Turing machine T as a Turing machine with a binary 
unidirectional input tape, an unidirectional output tape with alphabet A", and some 
bidirectional work tapes. We say T halts on input p with output x and write 
'^T{p)=x halts" if p is to the left of the input head and x is to the left of the output 
head after T halts. The set of p on which T halts forms a prefix code. We call 
such codes p self- delimiting programs. We write T{p) = x* if T outputs a string 
starting with x; T need not to halt in this case, p is called minimal if T{q)j^x* for 
all proper prefixes of p. The set of all prefix/monotone Turing machines {Ti,T2,...} 
can be effectively enumerated. There exists a universal prefix/monotone Turing 
machine U which can simulate every Tj. A function is called computable if there is 
a Turing machine which computes it. A function is called enumerable if it can be 
approximated from below. Let A4^mp be the set of all computable measures, M-lZum 
the set of all enumerable semimeasures, and Aidet be the set of all deterministic 
measures {=X°°)f' 

Levin jZLTOt ILV97j has shown the existence of an enumerable universal semimea- 

sure M (M>z/ Vz/G A^*^™^). An explicit expression due to Solomonoff jSol64t Eq.(7)] 
is 

M{x) := J2 2"^^^^^ KM{x) := -log M{x). (5) 

p:U {p)=x* 

The sum is over all (possibly nonhalting) minimal programs p which output a string 
starting with x. This definition is equivalent to the probability that U outputs a 
string starting with x if provided with fair coin flips on the input tape. M can 
be used to characterize randomness of individual sequences: A sequence Xi^oo is 
(Martin-Lof) /i-random, zjf 3c : M(xi:„) < c-yu(xi:„)Vn. For later comparison, we 
summarize the (excellent) predictive properties of M ^Sol78| lHut01a| lHut03a| IHut04j 
(the numbering will become clearer later): 



^Mlmm is enumerable, but is not, and Mdet is uncountable. 
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Theorem 1 (Properties of M = 2~^^) Solomonoff's prior M defined in (0j is 
a (i) universal, (v) enumerable, (ii) monotone, {Hi) semimeasure, which (vi) con- 
verges to fi i.m.s., and (vii) is self- optimizing i.m.s. More quantitatively: 



(VI 



J2tliE[J2x[{M{x[\x^t)-fJ'{x[\x^tW] < \n2-K{fi), which implies 



M{Xi\x^t) — — ^ f^{xt\x<:t) i.m.s. for ^^M. 



msr 
comp ■ 



{vii) E^iE[(/f"-/f^)2] < 2\n2-K{fi), which implies 



Imsr 
comp ' 



IAm *^ /Am i,rn.s. for fxEMl 
where K{fi) is the length of the shortest program computing function fj,. 



4 Alternatives to Solomonoff 's Prior M 

The goal of this work is to investigate whether some other quantities that are closely 
related to M also lead to good predictors. The prefix Kolmogorov complexity K is 
closely related to KM {K{x) = KM {x) + (log £{x))). K{x) is defined as the length 
of the shortest halting program on U with output x: 

K{x) := mm{e{p) : U{p) = x hahs}, k{x) := 2-^^''\ (6) 

In Section [7| we briefiy discuss that K completely fails for predictive purposes. More 
promising is to approximate M{x) = Y,p:U{j))=x*'^~^^^'' by the dominant contribution 
in the sum, which is given by 

m{x) ■= 2-^^'"(^-) with Km{x) := mm{£(p) : U{p) = x*}. (7) 

Km is called monotone complexity and has been shown to be very close to KM 
|Lev73a|lGac83j (see Theorem Efo)). It is natural to call a sequence a;i:oo computable 
if Km{xi:oo) < oo. KM, Km, and K are ordered in the following way: 

< K{x\i{x)) < KM{x) < Km{x) < K{x) < ^(x) -log jA"! + 2 log^(x). (8) 

The second inequality follows from the fact that, given n and Kraft's inequality 
J2x(^X"^i.^) ^ 1) there exists for xeX"' a Shannon-Fano code of length — logM(x), 
which is effective since M is enumerable. The other inequalities are obvious from 
the definitions. There are many complexity measures (prefix, Solomonoff, monotone, 
plain, process, extension, ...) which we generically denote by K E {K,KM,Km,...} 
and their associated "predictive functions" k{x) -.= 2'^^^^ E {k,M,m,...}. This work 
is mainly devoted to the study of m. 

Note that k is generally not a semimeasure, so we have to clarify what it means 
to predict using k. One popular approach which is at the heart of the (one-part) 
MDL principle is to predict the y which minimizes K{xy) (maximizes k{xy)), where 
X are past given data: yf^^^ \ = dxgm\'n.y^K{x^tyt)- 
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For complexity measures K, the conditional version K\{x\y) is often defined^ as 
K{x), but where the underlying Turing machine U has additionally access to y. The 
definition k\{x\y):=2~^^^^'^^'' for the conditional predictive function k seems natural, 
but has the disadvantage that the crucial chain rule is violated. For K = K and 
K = Km and most other versions of K, the chain rule is still satisfied approximately 
(to logarithmic accuracy), but this is not sufficient to prove convergence (j21) or self- 
optimization (jD). Therefore, we define k{xt\x^t) '■= k{xi;t)/k{x^t) in the following, 
analogously to semimeasures p (like M). A potential disadvantage of this definition 
is that k{xt\x^t) is not enumerable, whereas k\{xt\x^t) and k{xi._t) are. 

We can now embed MDL predictions minimizing K into our general framework: 
MDL coincides with the predictor for the error loss: 

y^k = aigmax k{yt\x<t) = argmax A;(a;<t?/t) = argrnin A'(a;<i2/t) = y^^^^ (9) 

In the first equality we inserted i^y = 1 — ^xy into Q . In the second equality we 
used the chain rule (H)). In both steps we dropped some in argmax ineffective addi- 
tive/multiplicative terms independent of yt- In the third equality we used k = 2~^. 
The last equality formalizes the one-part MDL principle: given x^t predict the yt&X 
which leads to the shortest code p. Hence, validity of (jl} tells us something about 
the validity of the MDL principle. (j2I) and address what (good) prediction means. 

5 General Predictive Functions 

We have seen that there are predictors (actually the major one studied in this work) 
Ap, but where p{xt\x^t) is not (immediately) a semimeasure. Nothing prevents 
us from replacing p in Q by an arbitrary function b\ : X* —>■ [0,oo), written as 
b\{xt\x<^t)- We also define general functions h: X* [0,oo), written as 6(xi:„) and 
h{xt\x^t) '■= fI^^' which may not coincide with b\{xt\x<:t)- Most terminology for 
semimeasure p can and will be carried over to the case of general predictive functions 
b and b\, but one has to be careful which properties and interpretations still hold: 

Definition 2 (Properties of predictive functions) We call functions b,b\ : 
X* [0,oo) (conditional) predictive functions. They may possess some of the fol- 
lowing properties: 

0) Proximity: b{x) is "close" to the universal prior M{x) 

1) Universality: b>M., z.e. Vz/G 3c>0:6(a;) >c-i^(x)Vx. 
a) Monotonicity: b{xi;t)<b{x^t)'^tiXi:t 

Hi) Semimeasure: Z]xt^(3^i:t) — ^(^<t) andb{e)<l 
iv) Chain rule: b{xi;t) = b.{xt\x<^t)b{x^t) 
^Usually written without index |. 
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v) Enumerability: b is lower semicomputable 

vi) Convergence: h.{x^\x^tY^^^ ^{x[\x^t)^^-^M.,x[EX i.m.s. or w.p.l 
vii) Self-optimization: if^- i.m.s. or w.p.l 

where b. refers to b or 6| 

The importance of the properties (z) — {iv) stems from the fact that they together 
imply convergence (vi) and self-optimization (vii). Regarding proximity (o) we left 
open what we mean by "close" . We also did not specify A4 but have in mind all com- 
putable measures M.^om,p enumerable semimeasures A^ermm; possibly restricted to 
deterministic environments M.det- 

Theorem 3 (Predictive relations) 

a) A semimeasure is monotone. 

b) (i) ,{iii) ,{iv) =^ (vi): The posterior b. as defined by the chain rule (iv) of a 
universal semimeasure b converges to /i i.m.s. for all ^^M.. 

c) ) ^ (o); Every w.r.t. Memlm universal enumerable semimeasure co- 
incides with M within a multiplicative constant. 

d) {vi) =^ (vii): Posterior convergence i.m.s. /w.p.l implies self- optimization 
i.m.s. /w.p.l. 

Proof sketch. (a) follows trivially from dropping the sum in (Hi), (b) is 
Solomonoff's major result ^o!M iLWfl IHutOlal UMij . (c) is due to Levin [ZlTo] . 



(d) follows from < l^"- -l^^ < 

Xj|x<t) — yu(x[|x<f)|, since £g[0,1] |Hut03a( 
Thm.4(22)]. * □ 

We will see that (i) ,{iii) ,{iv) are crucial for proving {vi),{vii). 

Normalization. Let us consider a scaled b version bnorm{xt\x<:t)'- = c{x<^t)b{xt\x^t), 
where c > is independent of Xt. Such a scaling does not affect the prediction 
scheme Ab Q, i.e. y^'' = y^*'^°™ , which implies l^^riorm =l^>>. Convergence 6(x^|x<t)^ 
li{x[\x<:t) implies J2x'K^t\^<t) ^ 1 if /i is a measure, hence also b 
n{x[\x<:t) for^ c(a;<t) := [J2x'^K^t\^<t)]~^- Speed of convergence may be affected by 
normalization, either positively or negatively. Assuming the chain rule (0) for 6„orm 
we get 



bnorin{Xl:n) — Y\. TT V ~ d(x<^n)b{xi-n) , d{x^n) '■— TTT Y\. 



t=i Ex, b{xi:t) ' ' ■ b{e) ^J^ Ext Kxi--t) 

Whatever b we start with, bnorm is a measure, i.e. (Hi) is satisfied with equality. 
Convergence and self-optimization proofs are now eligible for bnorm, provided uni- 
versality (i) can be proven for bnorm- If & is a semimeasure, then d> 1, hence 

•^Arbitrarily we define bnorniixt\x^t)^ if J2x'^Kxt\x<t)^0- 
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Mnorm>M>J^l'^^ is uiiiversal and converges (vi) with the same bound (Theorem 
Wivi)) as for M. On the other hand, (i(x<„) may be unbounded for b = k and b = m, 
so normahzation does not help us in these cases for proving (vi). Normahzation 
transforms a universal non-semimeasure into a measure, which may no longer be 
universal. 

X 

Universal Non-Semimeasures. If 6>M is a universal semimeasure, then b is as 
good for prediction as M. The bounds are loosened by at most an additive constant. 
For b still dominating M, but no longer being a semimeasure, we believe that {vi) 
and {vii) can be violated. Bounds can be shown without any further assumptions 
on b on-sequence and if we demand a lower and upper bound on b, i.e. b = M, then 
also off-sequence: 

Theorem 4 (Convergence of Universal Non-Semimeasures) For every pre- 
dictive function b, and real numbers a and c it holds: 

a) E7=i'^-b{xt\x<t) < ln2- KM {xi.,n) + In a-^ if aM{x) < b{x)^x, 
h) Er=iE5^,^.,Ka^t|a:<t) < l\n2-KM{xv.n) if aM{x) <b{x) < cM{x)^x. 

For computable xi.oo this implies: b{xt\x^t) -^0 and bnorm{,Xt\x<:t) -^0 for xt ^^Xt, 
and b{xt\x<t) if b{xt\x<:t) andbnorm{,Xt\x^t)^^ for t—^oo. 



Remarks. If b additionally is a semimeasure, i.e. Yjxt^xfi{^t\x<:t) < l — b{xt\x^t) 
then (a) implies an improved off-sequence bound. Note that 6(xt|x<t)— ^0 does not 
imply b{xt\x ^t) ^ ■ Furthermore, although 6„orm is a measure, convergence cannot 
be concluded similarly to (fTUI) . since 6„orm may not be universal due to a possibly 
unbounded normalizer d{x^t)- 

Proof. n n 

(a) ^ 1 - 6(xt|a;<t) < "^In b{xt\x<t)~^ = ln6(a;i:„)"^ 
t=i t=i 

< ln[aM(xi:„,)]"^ = \n2 ■ KM {xi,n) + In a-^ 

(b) 

6(x.|x<.) < Ka:*|x<.)-^^^ = ^^^ < -^^^ = -Mix,\x,,). 
For every semimeasure it holds: 

n n n 

H ^Pixt\x<t) < ^l-p{xt\x<t) < -^\np{xt\x<t) = -lnp(a;i;„) 

t = l XtJ^Xt t = l t=l 

Combining both bounds and using that M is a semimeasure we get 

c " ^ , . c 



J2 E K^t\x<t) < 7E E Mixtlx^t) < -ln2-KM{x,.,^). 

□ 



t=l Xt^Xt ^ t=l XtT^Xt ^ 
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6 Predictive Properties of m = 2 

We can now state which predictive properties of m hold, and which not. We first 
summarize the quahtative predictive properties of m in Corollary |3J and subse- 
quently present detailed quantitative results in Theorems El^o) — (t>ii), followed by 
an item-by-item explanation, discussion and detailed proofs. 



Corollary 5 (Properties of m = 2~^^) For b = m = 2~^"^, where Km is the 
monotone Kolmogorov complexity the following properties of Definition are 
satisfied/violated: (o) For every fi G M.^omp ^'^^ every fi-random sequence Xi.oo, 
fTT-ixi-.n) equals M{xi-n) within a multiplicative constant, m is (i) universal (w.r.t. 
M. = AA'^J!^p), {ii) monotone, and {y) enumerable, but is ^{iii) not a semimeasure. 
m satisfies (iv) the chain rule by definition for m.=m, but for m.=m\ the chain rule 
is only satisfied to logarithmic order. For m.=m, m (vi) converges and (vii) is self- 
optimizing for deterministic ^ E Ai'^^^^^pClAidet, but in general not for probabilistic 

The lesson to learn is that although m is very close to M in the sense of (o) and m 
dominates all computable measures n, predictions based on m may nevertheless fail 
(cf. Theorem P). 

Some proof ideas, (o) |ZL70t Thm.3.4] and |Lev73aj . (i) |Lev73aj . (ii) from 
Km{xy) > Km{x) (see definition of Km). -i{iii) follows from (i) ,{iv) ,-i{vi) and 
TheoremEt with m\ : = m. (iv) follows within log from Km = K-\-0{\og) and jLV97t 
Thm.3.9.1], -'{iv), since it does not even hold within an additive constant, (v) 
immediate from definition, {vi) similarly as for M. -i{vi) Use m\ g2~^° and define 
a /i| ^2"^". (vii) follows from (vi). -i{vii) For the monotone Turing machine U 
defined by U{lxO) = xO, the loss £oo = hi = 0, iw = 1, ioi = | and a Bernoulli(i) 
process /i(xt|x<t) = | one can show yf^™ = 07^ 1 = ?/ff , which implies /f™ = | > | = /f''. 
Extending U to a universal Turing machine by U{0^^^p) = U'{p) leaves this result 
intact with probability >1 — 2~*, since random strings cannot be compressed (by 
U'). □ 



6.0 Proximity of m = 2 

The following closeness/separation results between Km and KM are known: 

Theorem 6 (o) (Proximity of m = 2~^^) 

(1) W^eM]^oZp^fi-random uj3c^ : Km{ui.,n)<KM{uJi.,n) + CujWn, \Lev73aJi 

(2) KM{x)<Km{x)<KM{x) + 2\ogKm{x) + 0{l)Wx. IZLT^ Thm.3.4] 
-.(3) Vc : Km{x) —KM{x) >c for infinitely many x. lGdc83^ 
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Remarks. The first line (oi) shows that m is close to M within a multiplicative 
constant for nearly all strings in a very strong sense. sup„^^^ <2'^^ is finite for 
every u which is random (in the sense of Martin-Lof) w.r.t. any computable fi, but 
note that the constant depends on u. Levin falsely conjectured the result to be 
true for all u, but could only prove it to hold within logarithmic accuracy (02). A 
later result by Gacs "1(03), indeed, shows that Km — KM is unbounded (for infinite 
alphabet it can even increase logarithmically). 



Proof. The first two properties are due to Levin and are proven in |Lev73aj and 



|ZL7n| Thm.3.4], respectively. The third property follows easily from Gacs result 
|Gac83j ■ which says that if g is some monotone co-enumerable function for which 

Km{x)—KM{x) < g{C-{x)) holds for all x, then g{n) must be >K{n). Assume 
Km{x)—KM{x) >logi{x) only for finitely many x. Then there exists a c such that 

Km{x) — KM {x) <log i{x)+c for allx. Gacs' theorem now implies \ogn+c>K{n) Vn, 
which is wrong due to Kraft's inequality En2~^^"^ < 1- □ 

6.1 Universality of m = 2 ^™ 
Theorem 6 (i) (Universality of m = 2~^^) 

(1) Km{x)< -log n{x) + K{iJ,) if fiEMZ^p, ILW}\ Thm.4.5.4] 

i2)m>MZZp, but m^MfZL (unlike M>MfZin)- 

Remarks. The first line {ii) can be interpreted as a "continuous" coding theo- 
rem for Km and recursive /i. It implies (by exponentiation) that m dominates all 
computable measures {12)- Unlike M it does not dominate all enumerable semimea- 
sures. Dominance is a key feature for good predictors. From a practical point of 
view the assumption that the true generating distribution /i is a proper measure 
and computable seems not to be restrictive. The problem will be that m is not a 
semimeasure. 



Proof. The first line is proven in |LV97t Thm.4.5.4]. Exponentiating this result 



gives m(a;) >c^/i(a;) Vxj/iG i.e. m>Ai'^J^p. Exponentiation of -1(03) implies 

X 



m{x)tM{x)eMlZL i-e. m^MlZL- □ 
6.2 Monotonicity of m = 2-^"^ 

Monotonicity of Km is obvious from the definition of Km and is the origin of calling 
Km monotone complexity: 

Theorem 6 (ii) (Monotonicity of m = 2~^^) 

Km{xy) >Km{x) G-CVq, 0<m{xy) <m{x) ^2^^° < 1 =m(e). 
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6.3 Non-Semimeasure Property of m = 2 

While m is monotone, it is not a semimeasure. The following theorem shows and 
quantifies how the crucial semimeasure property is violated for m in an essential 
way. 

Theorem 6 (iii) (Non-Semimeasure property of m = 2~^"^) 

-.(1) If Xi:oQ is computable, then Yl,xt^{^i.t)^^{x<it) for almost allt, 
-.(2) If Km{xi,t)=o{t), then J2xt^{^i:t) ^''TT'{x<t) for most t. 

Remark. On the other hand, at least for computable environments, multiplying 
Theorem ini^fii&a) by m{x^t) shows that asymptotically the violation gets small, i.e. 
J2xt^{^i:t)-—^rn{x^t) for computable Xi:oo- 

Proof. Simple violation of the semimeasure property can be inferred indirectly 
from m possessing properties (i) ,{iv) ,-i{vi) (see Definition |2I) and Theorem To 
prove -i(iiii) we first note that Km{x) <oo for all finite strings xG A**, which implies 
"^(a^i:n)>0. Hence, whenever iim(xi:„) = Km(x<„), we have J2xn''^{^i.n)>m{xi:n) = 
m{x^n)i a violation of the semimeasure property. -i(iii2) now follows from 

#{t < : ^m(xi:t) < m(a;<j)} < i^{t < n : Km{xi:t) ^ Km{x^t)} 



< Yl[Km{xi;t) - Km{x<t)] = Km{xi:n), 



t=i 



where we exploited [ii) in the last inequality. □ 

6.4 Chain Rule for m = 2 ^"* 
Theorem 6 (iv) (Chain rule for m = 2~^^) 

(1) 0<mix\y):=^<l. 
^(2) //m|(x|?/) :=2"™"^p^'^(p)-^(P'2/)=a::*}^ ^^g^ 3x,y:m{yx)j^m\{x\y)-m{y). 
-.(3) Km{yx) =Km\{x\y) + Km{y)±0{logi{xy)). 

Remarks. Line 1 shows that the chain rule can be satisfied by definition. With such 
a definition, m{x\y) is strictly positive like M{x\y), but not necessarily strictly less 
than 1, unlike M{x\y). Nevertheless it is bounded by 1 due to monotonicity of m, 
unlike for k (see Theorem I?)). If a conditional monotone complexity Km\ = — log m\ 
is defined similarly to the conditional Kolmogorov complexity K\, then the chain 
rule is only valid within logarithmic accuracy (lines 2 and 3). 

Proof (ivi) is immediate from [ii). -i(iv2) follows from the fact that equality does 
not even hold within an additive constant, i.e. Km{yx) ^ Km{x\y) + Km{y) . The 
proof of the latter is similar to the one for K (see |LV97p . -i(iv3) follows within log 
from Km = K+0{log) and Theorem [7|^if). □ 
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6.5 Enumerability of m = 2 

m shares the obvious enumerabihty property with M and Km shares the obvious 
co-enumerabihty property with K: 

Theorem 6 (v) (Enumerability of 771 = 2-^"^) 

(1) m is enumerable, i.e. lower semicomputable. 

(2) Km is co-enumerable, i.e. upper semicomputable. 

6.6 Convergence of m = 2~^"^ 
Theorem 6 (vi) (Convergence of m = 2~^"^) 

(1) Y.lLi\'^-'m{xt\x^t)\<lKm{xi.n), m{xt\x<:t) ^1 for comp. Xi:oo- 

(2) Indeed, m(a;t|a;<t) 7^ 1 at most Km{x i;oo) times. 

(3) Y:i=Jlxt^xM^t\x<t)<2^'^^^^-^\ m{xt\x<tY^^ for comp. x^.,^. 

(4) Y!l=iY.xt^xt'frt{xt\x<t)<[Km{xi.,n)f, m{xt\x<:t)^-^ for comp. xi:oo. 
-(5) ys3U,xi;oc ■ Km{xi;oo)=s andYl'^iZxt^xt^{xt\x<t)>'i^ -'i- 

-(6) BfieM'^ZpX-^det ■ m(^norm)iXt\x<t)^-^IJ'iXt\x<t)yXi:oc, 

Remeirks. Line 1 shows that the on-sequence predictive properties of m for deter- 
ministic computable environments are excellent. The predicted m-probability^ of Xt 
given x<t converges rapidly to 1 for reasonably simple Xi-,^. A similar result holds 
for M. 

The stronger result (second line), that m{xt\x^t) deviates from 1 at most 
Km{xi:QQ) times, does not hold for M. 

Note that without constraint on the predictive function b, perfect on-sequence 
prediction could trivially be achieved by defining b,{Xf\x^t) = ^ Vx^, which correctly 
predicts Xt with "probability" 1. But since we do not know the true outcome Xt 
in advance, we need to predict the probability of x'f. well for all x'f. e X. m(|) also 
converges off-sequence for xt 7^ Xt (to zero as it should be) , but the bound (third 
line) is much weaker than the on-sequence bound (first line), so rapid convergence 
cannot be concluded, unlike for M, where M{xt\x^t) implies M{xt\x^t) ^^0, 
since X]^/M(a;^|x<t) < 1. Consider an environment Xi-^o describable in 500 bits, then 
bound (fis) does not exclude m{xt\x^t) from being 1 (maximally wrong) for all 
t = 1..2^°°; with asymptotic convergence being of pure academic interest. 

Line 4 presents a bound polynomial in Km, which is theoretically better than the 
exponential bound of line 3, but there is a pitfall due to the hidden multiplicative 
constant. 



''Wc say "probability" just for convenience, not forgetting that m(-|a;<t) is not a proper 
(semi) probability distribution. 
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Line 5 shows that for particular universal Turing machines this constant can be 
exponentially large. Note that this does not contradict the polynomial bound, since 
the multiplicative constant 2*^^ is allowed to depend on U. For a reasonable Turing 
machine, the compiler constant cu is of reasonable size, but 2^^^ is unreasonably 
large. Let U' be a Turing machine which you regard as reasonable. Then, for e.g. 
5 = 64 = 0(1), the U constructed in the proof is as reasonable as U' in the sense that 
a program of U' needs only to be prefixed by a short 64 bit word to run on U (the 
compiler constant between U and U' is small). In this sense, there are reasonable 
Turing machines U for which m makes the unreasonably large number of 2^^ — 2 
prediction errors on the trivial sequence Oi:oo, as we will show. 

Line 6 shows that the situation is provably worse in the probabilistic case. There 
are computable measures /i for which neither m{xt\x<:t) nor mnorm{xt\x<t) converge 
to /i(xf|x<t) for any Xi.oo- So while |VLOO| Thm.ll] and |LV97t Thm.5.2.3] stating 
that iJ,{xt:t+i\x<:t) = 'nT'{xt:t+i\x<t) for /i-random Xi:oo and fixed / is correct, the conclu- 
sion jVLOOl Cor. 2] and |LV97l Cor. 5. 2. 2] that (m is good for prediction in the sense 
that) maximizing /i(-|a;<j) is asymptotically equivalent to maximizing m(-|x<j), is 
wrong. For this to be true we would need convergence without multiplicative fudge, 
and which also holds off-sequence, i.e. m(^norm)ix^\x<:t) fJ'ix[\x<:t) , but which -^{vie) 
just shows to fail (even on-sequence). 

Proof (vii&2) #{t<n:m(xt|x<t)^l} < Er=i2|l-m(xt|x<t)| < 
< -Y.t=i^ogm{xt\x<t) = -logm(xi:„) = Km{xi.,n). 

In the first inequality we used m := m(xf|x<f) G 2^^^°, hence 1<2|1 — m| for m^l. 
In the second inequality we used 1— m<— ilogm, valid for mG [0,|]U{1} d2^'^". In 
the first equality we used (the log of) the chain rule n times. For computable xi-o^ 
we have YltLi\^—TT^{.x t\x^t)\^\Km{x i,oo) <oo, which implies m(xt|x<t)— ^0 (fast if 
Km{xi.aQ) is of reasonable size). This shows the first two lines of {vi). 

(via) Fix a sequence xi.oo and define Q:={x<tXf : t&IN ^xt^Xt}- Q is a prefix- 
free set of finite strings. For any such Q and any semimeasure p, one can show that 
I]a:GsP(^) — 1-^ Since M is a semimeasure lower-bounded by m we get 

n CO 
t=l xt^xt t=l xtj^xt x£Q xeQ 

With this, and using monotonicity of m we get 

V V m(x,|x<,) = E E ^^^^^^ < E E < — ^ = 2^-(--") 

t=lx,^x, t=lx,^x, ^i^<t) t=lx,^x, H^l-.n) m(xi;„) 

Finally, for an infinite sum to be finite, its elements must converge to zero. 

^This follows from 1> p{AuB)> p{A) + p{B) if Ar\B^{}, r^nrj^ = {} if x not prefix of y and 
y not prefix of x, where rx:={w.ijJi:e(x)~x}, hence J2xeQP(^^^—P^^xeQ^^^ — ^^ ^^'^ noting that 
p{x) is actually an abbreviation for p{Tx). 
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(vi4) For t<n we can bound 

m[xt\x<t) = 7 ^ < Km {x<t)^rjj ^ < Km [xi-n) M{xt\x<t) 

In the first inequality we exploited Theorem 101^02) in the exponentiated form 

X 

M{x)/Km^{x) < m(x) <M(x). In the last inequality we used monotonicity of m. 
Using Theorem 0] with a = c = 1 and b = M and KM < Km we get 



X 



J2 M^t\x<t) < Km^{xi.,n)Y^ M{xt\x<t) < \n2-Km^{xi.,n)- 

t=l xt^xt t=l xt^xt 

Note that using (oi) instead of (02) leads to a bound 2'^"ln2-ii'm(c(j), which for 
computable uo is also finite, but of unspecified magnitude due to the factor 2^'^. 

-i(vi5) Fix sElN and let tGT:={l,...,2'' — 2}. We define a universal monotone 
Turing machine U by f/(0") = 0°° and f/(g) = 0*"^* for g G {0,l}"\{0^1"}, where 
tGT is the natural number represented by the s-bit string q (any coding will do). 
Only for the purpose of making U universal, we define U{Vp) = U'{p) for pG{0,l}* 
and U' being some (other, e.g. your favorite) universal Turing machine. Obviously 
the length of the shortest programs on U for Oi:oo, 0<tl and 0<j is s, i.e. Km{Qi,oo) = 
Km{0^t) = Km{0<^t^)=s, which implies m(l|0<t) = l. So for a:i:oo = 0i:oo, we have 

00 2=-2 

Y E rnixt\x^t) > EMl|0<i) =2^-2, 

t=l Xt^Xt < = 1 

which proves ^{iv^). Note that m„orm(l|0<t) > j^, i.e. save a factor of \X\ the same 
lower bound holds for mnorm- Note also that on-sequence prediction is perfect, since 
m(0|0<j) = l VtGW. 

Remark. It is instructive to see why M{xt\x^t) converges fast to for this U: 
The single program of size s for 0<tl is outweighed by the 2^— t programs of size 
s for 0<i. Ignoring the contributions from U', we have M(l|0<t) ~ (^2'^-t).2-'> ~ 2^' 
hence Et=I^M(l|0<t) ^s-ln2. 

-■(vie) We show that the range of m(^norm) is not dense in [0,1] and then choose 
a /X not in the closure of the range. For binary alphabet X = {0,1}, the proof is 
particularly simple: We choose /x(l|a;<f) = |, hence /x(0| 3^<t) — §. Since m(^Xf\x^^) & 
2"-^° = {1,|,^,|,...}, we have |m(a;f|x<f)-/i(xt|a;<4)| >| Vt,Vxi:oo- Similarly for 

normK t\ <tj m{0\x<t) + m{l\x<t) [ 2"" + 2"™ ' "j 

1 _1 1 r 1 1 1 1 2 4 

: zEz 



1 + 2^ f''9'5'3'2'3'5'9' 



we choose /i(l|x<f) = l-;u(0|x<t) = ^, which implies \mnorm{xt\x<t)~fJ'{xt\x<t)\>-[2 
Vt, Va:i:oo. 
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Consider now a general alphabet A' = {1,...,|A:'|}, and the unnormalized m first. 
If \X\ is not a power of 2 we define /x(a;t|x<t) = lA"]^^. If \X\ is a power of 2 we define 
fj,{xt\x<:t) — fl'^l"^ for even xt and //(xj |a;<i) = ||'^|~^ for odd Xt- /i is a measure, 
nixt\x^t) € 2-^0, but m(xt|x<t) G 2-^". The only cluster^ point of 2--^° is 
0, since 7^/x^2~^o there exists 7>0 such that (/i — 7,/j+7)n2~-^° = {}, hence 
|m(xt|x<t)-/i(xt|x<t)| >7Vt, Va;i:oo for some 7>0. 

For ninorm wc proceed as follows: With zf.— Km{\\x ^t) ~ Km{i\x ^t) ^ ^ i we have 
m„(lk<0"^ = l+E!=22^'- We define 5:={l+m2+...+m|;t| : mie2^U{0}Vi}^0 
and X:—S~^ = {x~^ : xeS}. By construction, m„o„„(l|x<t) eX, and by symmetry 
also mnorm{xt\x<t) ^ X. The cross product X''^! :=Xx JffJ?!???.''. xX is a closed and 
countable set, since 2^U{0} is closed and countable, and finite sums, inversions, 
and cross products of closed/countable sets, are closed/countable.^° With A:={vG 
iRl-^l : < < 1, Y.\=iVi = 1} being the open 1 dimensional simplex, we have 
m„(-k<t)eXl'^lnA (e.g. I^nA^{{j^,j^):zeZ}). Since A\Xl^l is open 
and nonempty (due to countability of Xl*^'), there exists fj,{-\x^t) G A\Xl'^l and a 
Box:={v: \vi—fi{i\x^t)\ <l} of sufficiently small size 7>0 surrounding /i, such that 
BoxPlXl"^! = {}) which implies the desired result \m{xt\x^t)— IJ'{xt\x<t) \ >7- 

Remark. There is an easy proof for the weaker statement mnorm.{x[\x<:t) -h 
/x(xj|x<t), where x\ may be off-sequence: For //(0|x<t) = | = 1 — ^(l|x<t) we have 
Mi}^ = 3^2^, while "^"°'-'"(;^j^<«) e 2^. This imphes that the posterior of 
"fnnarm cauuot bc too closc to the posterior of // for all x[, i.e. 3a;^ and c > : 
)-^(xj|,T«)| >c (c=^ possible). One advantage of this proof is that 
it also goes through for infinite alphabet X. □ 

6.7 Self-optimization of m = 2~^™ 
Theorem 6 (vii) (Self-optimization of m = 2~^^) 

(1) Z^'"(x<t) ^-^^ l^"^ •.— a.Tgminy^ixtyt 'if<^ = xi:oo 'is computable. 

(2) A„ = Am„orm > '^■e- yt"" = y^""""^"" and = Z^^-no™ . 
^(3) y\y\>23i,n : l^'-/l^>^=c>lVt (c=|-£ possible). 

-.(4) 3£,/i : /f'"//^'" =c> 1 /or man?/ 1 wi^/i fi-probability > | (c= v/2 — £ possible). 
-.(5) V non-degenerate^^ £3U,fj, : Z^'"//^'' *-^7^1 TOt/i high probability. 

Remarks. Since (f i) implies (f iii) by continuity, we have convergence of the instan- 
taneous losses for computable environments Xi:co- but since convergence off-sequence 
is potentially slow, the convergence of the losses to optimum is potentially slow. 

point pG JR" is called a cluster point of a set iSC JR", if every open set of JR" which contains 
p, intersects S. 

-'^^W.r.t. standard topology on iR". 

^^A formal definition of non-degenerate is given in the remarks after the theorem. 
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Non-convergence -'{viQ) in probabilistic environments does not necessarily imply 
that Am is not self-optimizing, since different predictive functions can lead to the 
same predictor A. But -i{viii) shows that is not self-optimizing even in Bernoulli 
environments fi for particular losses i with probability > |. 

Interestingly, excluding binary action alphabets allows for a stronger for-sure 
statement -^{viis). 

In -i{vii'^), non-self-optimization is shown for any non-degenerate loss function 
(especially for the error loss, cf. 0), for specific choices of the universal Turing 
machine U. Loss i is defined to be non-degenerate iffClxexiv '■ ^xy = ^^^y^xy} = {} ■ 
Assume the contrary that a single action y is optimal for every outcome x, i.e. that 
(argmiuy can be chosen such that) argmin^f^y = y Vx. This implies y^''=y\/p, which 
implies l^™- /l^^^ = l. So the non-degeneracy assumption is necessary (and sufficient). 

Proof (viii) follows from (wii&s) and Theorem |211. 

(vii2) That normalization does not affect the predictor, follows from the defini- 
tion of y^p dni) and the fact that argmin() is not affected by scaling its argument. 

-i(vii3) Non-convergence of m does not necessarily imply non-convergence of the 
losses. For instance, for A' = 3^ = {0,1}, and u;^: = l/0 for /i(l|x<t) ^7:= J°^:,:^°°_^^^ , 
one can show that =yt'^', hence convergence of m(xt|x<t) to 0/1 and not to 
li{xt\x<:t) could nevertheless lead to correct predictions. 

Consider now a; G A* = {0,1}, y G 3^ = {0,1,2}. To prove -^{vii^) we define a loss 
function such that 7^ y^p for any p with same range as mnorm and for some p. 
The loss function lxo = x, ^xi = |, ^x2 = |(l — a^), and jj,: = ii{l\x^t) = \ will do. The 
p-expected loss under action y is /p : = Z)it=oP(^tk<t)^a;ty! ^p = P^ ^p~i' ^p~|(-^~P) 
with p := p(l|x<t) (see Figure [TJ. Since = = | > | = we have yf"p = 1 and 
/^^ = Zp = |. For p<|, we have /°</^</^, hence y^'' = Q and 1^"=^ = ^. For p>|, we 
have /^</i</o, hence y^'' = 2 and l^p = lf^ = ^. Since m„orm^(|,2)> ^mno™ predicts 
or 2, hence = fj^ = |. Since A^„„^^ = A„, this shows that ^ //f " = y| > 1- The 
constant can be enlarged to |— £ by setting ixi = ^+£ instead of |. 

For y = {0,...,|3^| — 1}, |3^| > 3, we extend the loss function by defining ixy = l 
V?/>3, ensuring that actions ?/>2 are never favored. For A' = {0,...,|A'|— 1}, |A'|>2, 
we extend p and define p{xt\x^t) = Vxj > 2. Furthermore, we define i^y = for 
x>2 and y<3. This ensures that the extra components of mnorm{xt\x^t) with Xt>2 
do not contribute to /^norm- Finally, and this is important, we define, solely for 
the purpose of this proof, "^norm(a;t|a;<j) = such that m„o™(0|x<t) + 

^norm{M^<t) = 1 (rather than J2xt=o ^norm(a^t|a;<t) = 1) (Normalization influences 
the analysis, but not the result). With these extensions, the analysis of the \X\ =2, 
\y\ = 3 case applies, which finally shows -i{vii). In general, a non-dense range of 
p{xt\x^t) implies l^p-^l^^-, provided |3^|>3. 

-i(vii4) We consider binary A* = 3^ = {0,1} first. The proof idea and notation is 
similar to -^{vii^). We choose a P-=p{l\x<:t)^-i+W- Let a^bEj^^ with a<p<b 
be the nearest (to p) possible values of rrinorm^ T+W- ^'^^ ^ fixed sequence xi:oo, 
we have either m(l|x<t) <a for (infinitely) many t or m(l|x<j) >b for (infinitely) 
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Figure 1 (Example loss used in proof of Theorem The p-expected 

expected losses 1^ under actions y Ey = {0,1,2} for X = {0,1} and loss function 
£00 = ^12 = 00, 4i=^ii = |, ^2 = 1; and iio = l are displayed as solid lines. 



many t (or both). Choosing Xi:oo at random, we have either m(l|a;<t) <a for many t 
with /i-probability > ^ or m{l\x^t) for many t with /x-probability > | (or both). 
Assume the former; for the latter the proof is analogous. We consider a loss function 
such that ll>la and lli<l1i. Then also /^>/J^ whenever m<a, which is the case for 
many t by assumption. Hence Z^™//^'' =/°//^ = c> 1. For instance, choose /i = -\/2— 1 
and £00 = and £10 = 1 (=^ Zp = p)- We get c = a/2 — 0(e) by choosing £01 = | + ^ 
and ^11 = (^/^= (| + £)(1 — p)) in the former case with a = | (and ioi = l — e and 
i^^ = (^/i = (l-£)(l-p)) in the latter case with 6=| and ll<l1 and /^>/°). The 
generalization to general X and y can be performed similarly to -i{viis). 

-■(viis) We first present a simple proof for a particular loss function and X = y = 
{0,1}, which contains the main idea also used to prove the general result. We define a 
monotone Turing machine U by U{lxO)=xO for all xeX*. More precisely, if the first 
bit of the input tape of U contains 1, U copies the half-infinite input tape (without 
the first 1) to the output tape, but always withholds the output until a appears. We 
have Km{xl) = Km{xlO) = i{x) + 2 = Km{xO) + l, which implies mnormiM^) — ^ 
m„orm(0|a;) = |. For the loss function 4o = ^ii = 0, £10 = 1, 4i = | and a Bernoulli(|) 
process p{xt\x^t) = l we get = |- 1 < i = and = | ■ | > 1 = hence 

^Amy'^Ap _^o^;i _ I jg j-^Q^ universal. We make U universal by additionally 

defining f/(0*^V) = f/'(p) for some (large, but reasonable) sG-ffV and some (other) 
universal monotone TM U' . We have to check whether this can alter (lower) the 
monotone complexity. Fix n. Every x of length n has description 1x0 of length 
n+2, so U' only matters if U'{p) =x* for some p of length <n — s + l. Since there 
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are at most 2"~^ minimal programs of length <n — s, the fraction of problematic x 
is at most 2~*. Since x is drawn at random, the loss ratio = |, hence, holds 

with high probability (>1 — 2~*). A martingale argument (see below) shows that 
this implies /^""//f" (w.h.p.). 

We now consider the case of general loss and alphabets. In case where am- 
biguities in the choice of y in argrainyixy matter we consider the set of solutions 
{argminj^^^;^} := {y : i^y = minyi^y} ^ {}. By assumption, i is non-degenerate, i.e. 
f]^^p^{axgmmyixy} — {}. Let be a minimal subset of X with f]^^p^^{axgmmyixy} — 
{}. Take any decomposition XoijXi = Xm with A'q 7^ { } 7^ A'l , which is possible, since 
\Xm\ > 2. Wc have := f],^^;^;^{aTgminyixy} 7^ {}, since Xm is minimal. Further, 
yo^yi = ym = {}- it is convenient to choose |A:'i| = l. W.l.g. we assume A:'i = {l}. 

Define some Qc{0,l}*, |Q| = |A'o|, a bijection b: Q— >A'o, and a one-to-one (onto 
A) decoding function d:{0,iy^A with ^ = A'ol^Ul{0,l}^\lQ C as d{x)^ 
b{x)V for x&Q and d{x) = lx for xe{0,1}''^\Q with a large sEiN to be determined 
later. We extend d to d : ({0,1}*)* ^ A* by defining d{zi...Zk) = d{zi)...d{zk) for 
-2jG {0,1}* and define the inverse coding function c:^— >-{0,l}* and its extension 
c:^*^({0,l}*)* by c = d-^. 

Roughly, U is defined as U {Ipi-snQ) — d{pi;sn)h{q)V for q^Q. More precisely, if 
the first bit of the binary input tape of U contains 1, U decodes the successive blocks 
of size s, but always withholds the output until a block qEQ appears. U is obviously 
monotone. Universality will be guaranteed by defining U (Op) appropriately, but for 
the moment we set f/(Op) = e. It is easy to see that for xeA* we have 

Kmi^xxo) = Km{xxoV) = i{c{x)) -|- s -|- 1 for G Aq, 
Km{xl) = Km{xlzOi:s+i) = i{c{x)) + 2s + 1 for any z e {0, 1}* \ Q, 
Km{xy) — — 00 for any y e A" \ (A'o U {1}). 

jm 

Hence, mnorm{xo\x) = [1-^01 + 2 *] ^ 1 and m„o,.„(l|a;) = [2*|A:'o| + l] ^ 
and mnorm{y\x) = 0. For t-1 E (s + l)iV we get 11^ := T,xt^norm{xt\x<:t)£xtyt 
j^\Ex,ex/xm- This implies 

e {argniin/^} C {argmiup^^ 4y} = fl {argmm42/} = 3^0- (11) 

Inclusion C holds for sufficiently large finite s. Equality = holds, since the set 
of points which are global maxima of a linear average of functions coincides with 
the set of points which simultaneously maximize all these functions, if the latter is 
nonempty. 

We now define //(z) = |^|~^ = 2~* for and i^{z)^0 for zeX^+^\A, extend 

it to ii{zi...Zk) := fi{zi) ■ fi{zk) for Zi E X^~^^, and finally extend it uniquely to 
a measure on X* by fi{x^t) '■= Y^xt ,rl^{.Xi:n) for IN 3 t <n E {s + l)IN. For x E A* 
we have ji{xQ\x) = fi^x^) = fi{xQV) = 2~* and fi{l\x) = /i(l) = J^yex^f^i^u) = 
E,e{o,ip\sMll/) = (2MQI)-2-^=l-|^o|2-*™l. For t-lE{s + l)lN we get 
l^^*-=J2xtK^t\x<t)Ltyt'^^iyf This implies 

l/f^ E {argmin/j;^*} C {argmin^ij^} = J^i for sufficiently large finite s. (12) 
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Since 3^on3^i = {}, (HD) and imply y^'"^y^f', which imphes /f^T^/f" (otherwise 
the choice t/^"" = yt'' would have been possible), which implies /f^™//^'' = c> 1 for 
t— 1 G + i.e. for (infinitely) many t. 

What remains to do is to extend f/ to a universal Turing machine. We extend 
U by defining U{Ozp) = U'{p) for any 2; €{0,1}^'*, where U' is some universal Turing 
machine. Clearly, U is now universal. We have to show that this extension does 
not spoil the preceding consideration, i.e. that the shortest code of x has sufficiently 
often the form Ip and sufficiently seldom the form Op. Above, fi has been chosen in 
such a way that c(x) is a Shannon-Fano code for yU-distributed strings, i.e. c(x) is 
with high //-probability a shortest code of x. More precisely, i{c{x)) <KmT{x) + s 
with /i-probability at least 1 — 2"'^, where Ktut is the monotone complexity w.r.t. 
any decoder T, especially T = U'. This implies minp{£(Op) : [/(Op) =x*} = 3s + l + 
Kmui{x) > 3s + H-£(c(x)) — s > ^{c{x)) + s + l > minp{£(lp) : U{lp) = x*}, where the 
first > holds with high probability (1 — 2^^*) and the last > holds with /i-probability 
1. This shows that the expressions (jlOp for Km are with high probability (w.h.p.) 
not affected by the extension of U. Altogether this shows /^™//^''=c>l w.h.p. 

A martingale argument can strengthen this result to yield non-selfoptimizingness. 
For zt := ^^^l^^^'^-j we have zq = 1, E[zt\ < 1, and E[zj|lo'<j] < Zt^i, hence —zt is a non- 
positive semi-martingale. |Doo53| Thm.4.1s,p324] now implies that 2:00 : = linit^oo-2t 



exists w.p.l and £[2:00] <limj^ooE[2;t] <1. The Markov inequality now yields 

P[hm {KM {iu,.,t) + log f^M) < -s] = P{z^ > r] < 2-'E[z^] < 2"^ 
t — ^00 

Substituting KM <Km^ Kmu' and — log/i(x) =£(c(a;)) this shows that C-{c{ijJi,t)) < 
Kmui{ijji-t) + s for almost all tG(s + l)IV with probability >1 — 2~*. Altogether this 
shows It"" /It^ w.h.p. □ 



7 Further Results 



Predictive Properties of k = 2~ . We briefly discuss the predictive properties of 
the prefix Kolmogorov complexity K. We will be very brief, since K completely fails 
for predictive purposes, although K is close to KM within an additive logarithmic 
term. 

Theorem 7 (Properties of k = 2~^) For b = k = 2~^ , where K is the prefix Kol- 
mogorov complexity, the following properties of Definition \^ are satisfied/violated: 
(o) KM{x) <K{x) <KM{x)+2\ogK{x). (i) ,{ii) ,{iii) are violated, {iv) is satisfied 
only for k. = k For k. = k\ (iv) is only satisfied to logarithmic order. In any case 
ivi) and {vii) can he violated for deterministic as well as probabilistic /iGA^^^^. 
iy) is satisfied. 

Proof sketch, (o) Similar to proof of Theorem 3.4 in jZL70j . -i(i) for deterministic 
/iG7\/l™^ with/i(Oi:„) = l, we have /c(0i:„)^0^1=/i(0i:„), since K{ui.,nY-^oo Muj. 
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since K{Oi:n)=K{n)>\ogn for most n, but <21oglogn for n being a power of 
2. implies -^{iii). (iv) within log follows from LV97, Thm.3.9.1]. ~'{w), since 

it does not even hold within an additive constant (see [1N97, p231]). (f ) immediate 
from definition. -i{vii) Define a universal prefix Turing machine U via some other 
universal prefix Turing machine U' by f/(00p) =U'{p)0, U{lp) =U'{p)l, f/(01) =e. 
For this U we have K{xO) = K{xl) + iyx {K = Ku), which implies that for the 
error loss always predicts 1. -i{vi) follows from -i{vii). □ 

Also, K{x\i{x)) is a poor predictor, since K{xO\£{xO)) = K{xl\£{xl)), and the 
additive constant can be chosen to ones need by an appropriate choice of U. Note 
that the larger a semimeasure, the more distributions it dominates, the better its 
predictive properties. This simple rule does not hold for non-semimeasures. Al- 
though M predicts better than m predicts better than k in accordance with (jHJ, 

2-^(^\^(^)) >M{x) is a bad predictor disaccording with (jHj). 

Simple MDL. There are other ways than m of using shortest programs for pre- 
dictions. We have chosen the (in our opinion) most natural and promising way. A 
somewhat simpler version of MDL is to take the shortest (nonhalting) program p 
which outputs X, continue running p, and use the continuation y of x for prediction: 

rn\{xt\x<:t) := 1 if shortest program for x<t* computes x^^tXt*, m|(xt|x<t) := 0. 



Theorem 8 (Properties of m) For the simple MDL predictor rn\{xt\x<:t) o,nd 
'fn{xi-n)'- = Y{t=i'^\{^t\x<t), the following holds: m is a deterministic, {ii) monotone, 
{Hi) measure, satisfying (iv) the chain rule (by definition), is -i{i) not universal 
w.r.t. Ai'^^^pClAidet, o-nd is ~^{v) not enumerable, and is -i{vi) not convergent and 
-i{vii) not self- optimizing w.r.t. some /iGA^^^^. 

Note that rn\ contains more information than m. rn\ cannot be reconstructed 
from m, since rfi\{x[\x^t) is defined even if m(x<j)=0. -^{vi) and -i{vii) follow from 
non-denseness {m|} = {0,l}. For -i{i) take u; = l°° in case m(l)=0, and 0°° otherwise. 
We did not check the convergence properties for deterministic environments. 

Another possibility is to define m = f{Km) with / some monotone decreasing 
function other than f{Km) = 2~^"^, since 771 = 2"^™ is not a semimeasure anyway. 
We do not expect exciting results. 



8 Outlook and Open Problems 

Speed of off-sequence convergence of m for computable environments. A 

more detailed analysis of the speed of convergence of m{xt\x^t) to zero in deter- 
ministic environments would be interesting: How close are the off-sequence upper 
bound {vi4^) = Km? and the lower bound -1(^5) 2^* — 2. Can the lower bound be 
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improved to 2^ ■ Km? Maybe for the witnesses of m ^ Ml The upper bound can 
be improved to = Km? - logKm. Can the bound be improved to = Kml Probably 
the most interesting open question is whether there exist universal Turing machines 
for which the multiplicative constant is of reasonable size. We expect that these 
hypothetical TMs, if they exist, are very natural in the sense that they also possess 
other convenient properties. 

Non-self-optimization for general U and Another open problem is whether 
for every non-degenerate loss-function, self-optimization of can be violated. We 
have shown that this is the case for particular choices of the universal Turing machine 
U . If Km were self-optimizing for some U and general loss, this would be an unusual 
situation in Algorithmic Information Theory, where properties typically hold for 
all or no U . So we expect A^ not to be self-optimizing for general loss and U 
(particular /i of course). A first step may be to try to prove that for all U there 
exists a computable sequence Xi:oo such that K{x^tXt) < K^x^t^t) for (infinitely) 
many t (which shows -i{vii) for K and error loss), and then try to generalize to 
probabilistic fi, Km, and general loss functions. 

Other complexity measures. This work analyzed the predictive properties of 
the monotone complexity Km. This choice was motivated by the fact that m is 
the MDL approximation of the sum M, and Km is very close to KM. We expect 
all other (reasonable) alternative complexity measure to perform worse than Km. 
But we should be careful with precipitative conclusions, since closeness of uncon- 
ditional predictive functions not necessarily implies good prediction performance, 
so distantness may not necessarily imply poor performance. Besides the discussed 
prefix Kolmogorov complexity K |Lev74| IGac74t ICha75j , monotone complexity Km 
|Lev73aj . and Solomonoff's universal prior M = 2~^^ |Sol64t l5ol78| IZLTO] . one may 
investigate the predictive properties of the plain Kolmogorov complexity C |Kol65j , 
process complexity |Sch73j . Chaitin's complexity Kc |Cha75j . extension semimeasure 
Mc |Cov74j , uniform complexity |Lov69bl ILov69aj , cumulative K^ and general K'^ 
complexity and corresponding measures |Sch02a], predictive complexity KP |VW98j . 
speed prior S [S ch02bj . Levin complexity jLev73b. iLev84j . and several others. Most 
of them are described in |LV97j . Many properties and relations are known for the 
unconditional versions, but little relevant for prediction of the conditional versions 
is known. 

Two-part MDL. We have approximated M{x) :=J2p:U{p)=x*'^~^^^^ by its dominant 
contribution m{x) = 2~^"^^^\ which we have interpreted as deterministic or one-part 
universal MDL. There is another representation of M due to Levin JZL70J as a mix- 
ture over semimeasures: Mix) = Y„,^ M^^^^i 2~^^'^^u(x) with dominant contribution 
'm'2{x) =2~^™2(x) g^j^^ universal two-part MDL Km2{x) := min,^g_yvi|=™^{— log z^(a;) + 
Klu)}. MDL "lives" from the validity of this approximation. K{iy) is the complex- 
ity of the probabilistic model z/, and — logi/(x) is the (Shannon- Fano) description 
length of data x in model u. MDL usually refers to two-part MDL, and not to 
one-part MDL. A natural question is to ask about the predictive properties of m2. 
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similarly to m. m2 is even closer to M than m is (m2 = M), but is also not a 
semimeasure. Drawing the analogy to m further, one may ask whether (slow) pos- 
terior convergence m2^yU w.p.l for computable probabilistic environments holds. 
In |PH04at IPH04bj we show, more generally, slow posterior convergence of two-part 
MDL w.p.l in probabilistic environments /i. See also |BC91j . for convergence results 
for two-part MDL in i.i.d. environments. 

More abstract proofs showing that violation of some of the criteria [i) — {iv) 
necessarily lead to violation of (vi) or (vii) may deal with a number of complexity 
measures simultaneously. For instance, we have seen that any non-dense posterior 
set {k{xt\x^t)} implies non-convergence and non-self-optimization in probabilistic 
environments; the particular structure of m did not matter. Maybe a probabilistic 
version of Theorem 0] on the convergence of universal non-semimeasures is possible 
under some (mild?) extra assumptions on b. 

Extra conditions. Non-convergence or non-self-optimization of m do not neces- 
sarily mean that m fails in practice. Often one knows more than that the environ- 
ment is (probabilistically) computable, or the environment possess certain additional 
properties, even if unknown. So one should find sufficient and/or necessary extra 
conditions on /i under which m converges / self-optimizes rapidly. The results 
of this work have shown that for m-based prediction one has to make extra as- 
sumptions (as compared to M). It would be interesting to characterize the class of 
environments for which universal MDL alias m is a good predictive approximation 
to M. Deterministic computable environments were such a class, but a rather small 
one, and convergence can be slow. 
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